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ABSTRACT 

We present Wavemoth, an experimental open source code for computing scalar spherical harmonic 
transforms (SHTs). Such transforms are ubiquitous in astronomical data analysis. Our code performs 
substantially better than existing publicly available codes due to improvements on two fronts. First, 
the computational core is made more efficient by using small amounts of precomputed data, as well as 
paying attention to CPU instruction pipelining and cache usage. Second, Wavemoth makes use of a fast 
and numerically stable algorithm based on compressing a set of linear operators in a precomputation 
step. The resulting SHT scales as 0(L 2 log 2 L) for the resolution range of practical interest, where L 
denotes the spherical harmonic truncation degree. For low and medium-range resolutions, Wavemoth 
tends to be twice as fast as libpsht, which is the current state of the art implementation for the 
HEALPix grid. At the resolution of the Planck experiment, L ~ 4000, Wavemoth is between three and 
six times faster than libpsht, depending on the computer architecture and the required precision. Due 
to the experimental nature of the project, only spherical harmonic synthesis is currently supported, 
although adding support for spherical harmonic analysis should be trivial. 

Subject headings: Methods: numerical 



1. BACKGROUND 



The spherical harmonic transform (SHT) is the spher- 
ical analog of the Fourier transform, and is an essential 
tool for data analysis and simulation on the sphere. A 
scalar field f(8, <f>) on the unit sphere can be expressed as 
a weighted sum of the spherical harmonic basis functions 



Yf, 



E 



r, 



(1) 



The coefficients ai m contain the spectral information of 
the field, with higher I corresponding to higher frequen- 
cies. In calculations the spherical harmonic expansion is 
truncated for I > L, and the spherical field represented 
by 0(L 2 ) grid samples. Computing the sum above is 
known as the backward SHT or synthesis, while the in- 
verse problem of finding the spherical harmonic coeffi- 
cients ai m given the field / is known as the forward SHT 
or analysis. 

In order to compute an SHT, the first step is nearly 
always to employ a separation of sums, which we re- 
view in Section 2.3, to decrease the cost from 0(L 4 ) to 
0(L 3 ). We will refer to codes that take no measures be- 
yond this to reduce complexity as brute-force codes. Of 
these, HEALPix (Gorski et al. 2005) is one very widely 
used package, in particular among CMB researchers. 

Recently, the libpsht package (Reinecke 2011) halved 
the computation time with respect to the original 
HEALPix implementation, simply through code opti- 
mizations. As of version 2.20, HEALPix uses libpsht 
as the backend for SHTs. Other packages using the 
brute-force algorithm include S 2 HAT (Hupca et al. 2010; 
Szydlarski et al. 2011), focusing on cluster parallelization 
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and implementations on the GPU, as well as GLESP 
(Doroshkevich et al. 2005) and ssht (McEwen & Wiaux 
2011), focusing on spherical grids with more accurate 
spherical harmonic analysis than what can be achieved 
on the HEALPix grid. 

The discovery of Fast Fourier Transforms (FFTs) has 
been all-important for signal analysis over the past half 
century, and there is no lack of high quality commercial 
and open source libraries to perform FFTs with stunning 
speed. Unfortunately, the straightforward divide-and- 
conquer FFT algorithms do not generalize to SHTs, and 
research in fast SHT algorithms has yet to reach maturity 
in the sense of widely adopted algorithms and libraries. 

The libftsh library (Mohlenkamp 1999) uses local 
trigonometric expansions to compress the spherical har- 
monic linear operator, resulting in a computational scal- 
ing of 0(L 5 / 2 logL) in finite precision arithmetic. Sphar- 
monicKit (Healy et al. 2003) implements a divide-and- 
conquer scheme which scales as 0(L 2 log L). We com- 
ment further on these in Section 4.4. Other algorithms 
have also been presented but either suffer from problems 
with numerical stability, are impractical for current reso- 
lutions, or simply lack publicly available implementations 
(e.g., Suda k Takami 2002; Kunis k Potts 2003; Rokhlin 
k Tygert 2006; Tygert 2008, 2010). 

We present Wavemoth 2 , an experimental open source 
implementation of the algorithm of Tygert (2010). This 
algorithm has several appealing features. First, it is sim- 
ple to implement and optimize. Second, it is inherently 
numerically stable. Third, its constant prefactor is rea- 
sonable, yielding substantial gains already at L ~ 2000. 
The accuracy of the algorithm is finite, but can be ar- 
bitrarily chosen. For any given accuracy, the computa- 
tional scaling is 0(L 2 log 2 L), but lowering the requested 
accuracy makes the constant prefactor smaller. 

2 http://github.com/wavemoth; commit 59ec31b8 was used to 
produce the results of this paper. 
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We stress that our work consists solely in providing an 
optimized implementation. While we review the basics 
of the algorithm in Section 3, Tygert (2010) should be 
consulted for details and proofs. We have focused in 
particular on the HEALPix grid, and use libpsht as our 
baseline for comparisons. However, all methods work 
equally well for any other grid with iso-latitude rings. 

Section 2 reviews SHTs in more detail, as well as the 
computational methods that are widely known and used 
across all popular codes. Section 3 reviews the algorithm 
of Tygert (2010) and how we have adapted it to our pur- 
poses. Section 4 focuses on the high-level aspects of soft- 
ware development and provides benchmarks, while an 
appendix provides the low-level implementation details. 

2. BASELINE ALGORITHMS 

2.1. The spherical harmonic basis functions 

We use the convention that points on the sphere are 
parameterized by a co-latitude e [0,7r], where corre- 
sponds to the "north pole", and a longitude <j) G [0, 2w). 
The spherical harmonic basis functions Yg m (6,(f)) can 
then be expressed in terms of the associated Legendre 
functions P™(z). Assuming to > 0, we have 



Ye m (9, 
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(2) 



where we define the normalized associated Legendre func- 
tion Pp. Our definition follows that of Press et al. 



(2007); the normalization differs by a factor of yl/2 
from the one in Tygert (2010). 

Note that while the spherical harmonics Yg m and the 
coefficients ae m are complex, Pp is real for the argu- 
ment range of interest. For negative to, the symmetry 
Ye,-m = ( — l) m ^m can be used, although this is only 
needed for complex fields. Wavemoth only supports real 
fields, which have spherical harmonic expansions obeying 
a tm = {-l) m a}_ m . 

2.2. Discretization and the forward transform 

For computational work one has to assume that one 
is working with a band-limited signal, so that ag m = 
when t > L. The SHT synthesis is then given simply by 
evaluating equation (1) in a set of points on the sphere. 

The opposite problem of computing at m given 
f(9j,(pj), namely spherical harmonic analysis, is less 
straightforward. In the limit of infinite resolution, we 
have 

a tm = y/(0,0)r/ m (0,0)dO, (3) 

where cZil indicates integration over the sphere. This 
follows easily from the orthogonality property, 



/ 
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(4) 



There is no canonical way of choosing sample points on 
the sphere. The simplest grid conceptually is the equian- 
gular grid. Doroshkevich et al. (2005) and McEwen & 
Wiaux (2011) describe grids that carries the orthogonal- 
ity property of the continuous spherical harmonics over 



to the discretized operator. In contrast, the HEALPix 
grid (Gorski et al. 2005) trades orthogonality for the 
property that each pixel has the same area, which is con- 
venient for many operations in the pixel basis. 

Independent of what grid is chosen, a natural approach 
to spherical harmonic analysis is to use a quadrature rule 
with some weights Wj , so that 



(5) 



On the HEALPix grid the numerical accuracy of this 
approach is limited, but it is still the most common pro- 
cedure. 

Some real world signal analysis problems do not need 
the forward transform at all. In the presence of measure- 
ment noise in the pixel basis, one can argue that the best 
approach is not to pull the noise part of the signal into 
spherical harmonic basis at all. For instance, consider 
the archetypical CMB data model, 



d = Ys + n, 



(6) 



where d represents a vector of pixels on the sky with 
observed data (not necessarily the full sky), s represents 
our signal of interest in spherical harmonic basis, and 
n represents instrumental noise in each pixel. Spherical 
harmonic synthesis is denoted Y; note that equation (1) 
describes a linear operator and can be written f = Ya. 

If we now assume that s and n are Gaussian random 
vectors with vanishing mean and known covariance ma- 
trices S and N, respectively, then the maximum likeli- 
hood estimate of the signal is given by 



s = (S- 1 + Y^N~ 1 Y)~ 1 Y^N~ 1 d, 



(7) 



with s in spherical harmonic basis. This system can be 
solved with reasonable efficiency by iterative methods. 
Note that we are here only concerned with the effect 
of Y as a non-invertible projection, and that no spheri- 
cal harmonic analysis is ever performed, only the adjoint 
synthesis. Thus, neither the non-orthogonality caused 
by the HEALPix grid, nor masking out large parts of 
the sky, is a concern. See Eriksen et al. (2008) and ref- 
erences therein for more details on this technique in the 
context of CMB analysis. 

2.3. Applying the Fast Fourier Transform 

The first step in speeding up the spherical harmonic 
transform beyond the 0(L 4 ) brute-force sum is a simple 
separation of sums. For this to work well, pixels must 
be arranged on a set of iso-latitude rings, with equidis- 
tant pixels within each ring. All grids in use for high- 
resolution data has this property. 

We show the case for SHT synthesis; analysis can be 
treated in the same way. Starting from equation (1), we 
have, for pixel j within ring k, and with Zk = cosflfc, 



f(6k,<f>k,j) = X! 

m=—L 
L 

^2 1rn{zk)e 



0,l m PP{zk) 

t=\m\ 



(8) 
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where we introduce q m (zk)- Assuming that ring k con- 
tains Jfc pixels, their equidistant longitude is given by 

<t>k,j = 4>k,o + (9) 

Jk 

Since e lx has period 2tt, and since q m (z k ) — whenever 
|m| > L, we find that 

L ,7 fc -l 

E 1k, m e im ^ = J) ^(^)e 27rj</Jfc (10) 

m=-L J=0 

with 

oo 

Tj (z k )= ]T ^W^K^ ^'^- (11) 

i— — oo 

Thus one can phase-shift the coefficients q m {zk) to match 
the ring grid, wrap around or pad with zeros, and per- 
form a regular backward FFT. The symmetries of the 
spherical harmonic coefficients of a real field carry over 
directly to the Hcrmitian property of real Fourier trans- 
forms. 

This separation of sums represents a first step in speed- 
ing up the SHT, and is implemented in all packages for 
high-resolution spherical harmonic transforms. 

2.4. Legendre transforms and even/odd symmetry 

The function q m {z) introduced in equation (8) is known 
as the (Associated) Legendre transform of order m, 

L 

q m (zk) = ^2 P" l (z k )a em , (12) 

assuming m > 0. The following symmetry cuts the arith- 
metic operations required in a SHT in half, as long as the 
spherical grid distributes the rings symmetrically around 
the equator. For any non- negative integer n, the func- 
tions Pm+2n( z ) are even an( l Pm+2n+l( z ) ar e odd. We 

define q^ en and q^ d so that g™° contains the even- 
numbered and q^ d the odd-numbered terms of equation 
(12), and so that 

q m (z)=q^(z) + q°? d (z). (13) 

Then, since q^ en and q^ d are weighted sums of even and 
odd functions, respectively, they are themselves even and 
odd, so that q m (— z) can be computed at the same time 
essentially for free, 

q m (-z)=q e ™(z)-q°J d (z). (14) 

For spherical harmonic analysis, one uses the orthogo- 
nality property. Assuming m > 0, 

J PP(z)PP(z)dz = 5 u ,, (15) 

so that 

a tm = J PT{z)q m (z)dz. (16) 

As discussed in Section 2.2, the resulting quadrature used 
in calculations can be exact or approximate, depending 
on the placement of the pixel rings. One can also in this 
case cut computation time in half by treating even and 
odd I — m separately. 



3. FAST LEGENDRE TRANSFORMS 

As the Fourier transform part is essentially a solved 
problem, efforts to accelerate SHTs revolve around 
speeding up the Legendre transforms. Let us write equa- 
tion (12) as 

q = A T a, (17) 

where we leave m and the odd versus even case implicit. 
For a full SHT, such a product must be computed for 
each of 2(L+1) different A matrices. The backwards Leg- 
endre transform required for spherical harmonic analysis 
is similarly 

a = Aq, (18) 

give or take a set of quadrature weights. 

The idea of Fast Legendre Transform algorithms is to 
compute equations (17) and (18) faster than 0(LN v - lnE ). 
The approach of Tygert (2010) is to factor A as a prod- 
uct of block-diagonal matrices in a precomputation step, 
which can significantly reduce the number of elements in 
total. This technique is known as butterfly compression, 
and was introduced by Michielssen & Boag (1996). The 
accuracy of the compression is tunable, but even nearly 
loss-less compression with close to double precision ac- 
curacy is able to yield significant gains as the resolution 
increases. We review the algorithm below, but stress 
again that the reader should consult Tygert (2010) for 
the full details. The butterfly compression technique was 
introduced by, 

3.1. The interpolative decomposition 

The core building block of the compression algorithm 
is the Interpolative Decomposition (ID), described in 
Cheng et al. (2005). Assume that an m x n matrix A 
has rank k, then the ID is 

A = A (fe) A, (19) 

The matrix A^ fe ) , known as the skeleton matrix, consists 
of k columns of A, whereas A, the interpolation matrix, 
interpolates the eliminated columns from the ones that 
are preserved. Of course, k of the columns of A must 
form the identity matrix. 

The ID is obviously not unique; the trick is to find a 
decomposition that is numerically stable. The algorithm 
of Cheng et al. (2005) finds an interpolation matrix A 
so that no element has absolute value greater than 2, all 
singular values are larger than or equal to 1, and the 
spectral norm is bounded by \J 'Ak(n — k) + 1. The nu- 
merical precision of the decomposition is tunable, as the 
decomposition found by the algorithm satisfies 

||A - A«A|| < v/4fc(n -k) + lo-fc+i (20) 

where a k +i is the (k + 1) greatest singular value of A. 
Implementing lossy compression is simply a matter of 
reducing the accuracy required of the IDs we use. 

3.2. Butterfly matrix compression 

We now use the ID recursively to factor the matrix A. 
After applying p levels of compression, we have 

A = RS p P p _ 1 S p _i---P 2 S 2 P 1 S 1 , (21) 
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Level 2 







R 2 



Level 3 




R 2 R I S 3 

Figure 1. Illustration of the butterfly matrix compression scheme. On the first level, we use the Interpolative Decomposition to compress 
sub-blocks of the matrix A and produce the factorization A = R/Si, where all blocks in R'j have full rank. We then proceed by permuting 
the columns of R' x so that A = R1P1S1 , in order to create new rank-deficient blocks. The contents of the Si matrix is saved as precomputcd 
data, while we carry Ri along for further compression on the next level. The algorithm continues in this fashion until the residual matrix 
R only consists of a single diagonal of full-rank blocks. The final factorization becomes A = RS3P2S2P1S1. The permutations involved 
are known in the FFT literature as butterfly permutations; the "butterfly" can be seen twice in the pattern of Pi. 
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where R is a block-diagonal residual matrix containing 
elements that were not compressed, the Si are block- 
diagonal matrices containing compressed data, and the 
Pi are permutation matrices. See Figure 1 for an il- 
lustration. The structure of the permutations are very 
similar to the butterflies used in FFT algorithms, hence 
the name of the compression scheme. In fact, if one lets 
Si contain a specific set of 2 x 2-blocks on their diagonals 
one recovers the famous Cooley-Tukey FFT. In our case 
the blocks will be significantly larger, typically around 
150 x 150, although with much variation. 

We start by partitioning A into 2 P column blocks. The 
number of levels p is mainly determined by the number of 
columns in the matrix, so that the column blocks all are 
roughly of the same predetermined width. In our case, 
64 columns worked well. 

We then split each block roughly in half horizontally, 
and compress each resulting block using the ID, 



T i,i 
B ii 

-.(*) 



Tl,2 
Tl.2 



(T\ K { • T M ) {T[ k l • T 1;2 ) 
(B$.B M ) (B$.B li2 ) 



where the first subscript of each matrix refers to this 
being the first iteration of the algorithm. It is useful to 
write the above matrix as 



A = 



-.(*) 



-.(*) 



B l.l 



R (fe) 
U l,2 



Ti,i 
Bi.i 



Ti, 2 

Bi,2 



We denote the right matrix Si. It can not be further 
processed and its blocks are simply saved as precomputed 
data, making use of the fact that each block embeds the 
identity matrix in a subset of its columns. 

The left matrix can be permuted and further com- 
pressed. For some permutation matrix Pi we have: 



A = 



T (fc) 
-"■1,1 



T (fc) 
1,1 



B (fc) 
1,1 

T.(fc) 

1,2 



T (fe) 
A l,2 



g(fe) 
1,1 



D l,2 



g(fe) 
1,2 



Si 

PiSi. 



Then we join blocks horizontally, split them vertically, 
and compress each resulting block. For the top-left cor- 
ner we have 



rp(fe) rp(fe) 

1,1 1,2 



T 2 ,i 

B 2.1 



(rp( k ) 

\ ± 2,1 



T 2 ,i) 
B 2 ,i) 



(22) 



Applying this to all blocks in the matrix, we get 
(T?] • T 2 ,i) 



(B^j • B 2 ,i) 



(Bg 



-.(*) 



B 



(fe) 

2,1 



T 2 , 2 ) 

B 2 ,2) 

T (fe) 

A 2,3 



PiSi 



B 



(fe) 

2,3 



B 



(fe) 

2,2 



T 2 ,i 
B 2 ,i 



B 2 ,2 



PiSi. 



And so the scheme continues. For each iteration the num- 
ber of diagonals in the left matrix is halved, the number 
of blocks in each diagonal is doubled, and the height of 
each block is roughly halved. Eventually the left ma- 
trix consists only of a single diagonal band of blocks, 
and further compression is impossible. This becomes the 
residual matrix R of equation (21). 

The efficiency of the scheme relies on the non-trivial 
requirement that the and B( fe ) blocks are rank- 

deficient at every level of the algorithm. To get a handle 
on which matrices exhibit this behavior, we start with 
assuming the rank property, namely that any contigu- 
ous rectangular sub-block of A, up to the numerical pre- 
cision chosen, has rank proportional to the number of 
elements in the sub-block. That is, the rank does not 
depend on the location or shape of the block. Now, each 
time the butterfly algorithm joins two skeletons, such as 



-.(fe) rp(fc)! 

■ 1,1 



■ 1.2 J 



in equation (22), the resulting matrix has 
roughly 2k columns while spanning out a corresponding 
block of A of rank k. Therefore, half of the columns can 
be eliminated by applying the ID. Since the data volume 
is roughly halved at each compression level, and since 
Si at each level has 0{L) interpolative matrices of size 
roughly k x 2k = O(l), the resulting compressed rep- 
resentation of A has O(LlogL) elements. See Tygert 
(2010) for a more detailed argument. 

O'Neil et al. (2010) proves the rank property in the case 
of Fourier transforms and Fourier-Bessel transforms. It 
is however not proven in the case of associated Legendrc 
functions P™(z). Figure 2 shows our results for resolu- 
tions up to L ~ 130000; we discuss these results further 
in Section 4.3. 

3.3. Notes on interpolation 

Tygert (2008) describes an elegant and exact interpo- 
lation scheme which, in the case of the HEALPix grid 
and L = 2N s id e , reduces the number of required evalua- 
tion points for q m {zk) by 2/3. Although our conclusion 
was not to include this step in our code, we include a 
brief discussion in order to motivate our decision. 

We focus on the even Legendre functions, the odd case 
is similar. Let n be an integer such that L < m + 2n. 
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L 
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Figure 2. Efficiency of the butterfly compression scheme. Left panel: Estimated size of compressed data (solid) compared to uncompressed 
matrix size (dashed). A line proportional to 0(L 2 log V) (dotted) is shown for comparison. In each case we use a HEALPix grid with 
^sido = L/2. Right panel: A closer look on the computational scaling. The size of the compressed data is shown divided by 0(L 2 log 3 L) 
(dashed), 0(L 2 log 2 L) (solid) and 0(L 2 logL) (dotted), using an arbitrary normalization. 



The function Pm+2n( x ) nas n roots in the interval (0, 1), 
which we denote Z\,...,z n . Now, assuming that we have 
evaluated q^ cn in these roots, we can interpolate to any 
other point y G (—1,1) by using the formula 



C cn (y) = ^( y )^-J^c on (^), 



ir 



(23) 



for some precomputed weights co(y) and "f(zi). The proof 
relies on the Christoffel-Darboux identity for the normal- 
ized associated Lcgcndre functions (Tygert 2008; Jakob- 
Chien & Alpert 1997). The Fast Multipole Method 
(FMM) allows the computation of equation (23) for p 
points with operation count of order 0(p + n) rather 
than 0(pn). The FMM was originally developed for ac- 
celerating TV-body simulations, but is here motivated al- 
gebraically. For more information about one-dimensional 
FMM we refer to Yarvin & Rokhlin (1999) and Dutt ct 
al. (1996). 

The reason we did not include this step in our code 
is that much of the interpolation is already embedded 
in the butterfly matrix compression. Consider for in- 
stance A sidc = 2048, L = 3A sidc and to = 2000. The 
full matrix A occupies 65 MiB when evaluated in the 
HEALPix co-latitude nodes, and only 49 MiB when eval- 
uated in the optimal nodes as described above. However, 
after compression the difference is only 10.4 MiB versus 
9.4 MiB. Thus the butterfly compression compensates, 
at least partially, for the over-sampling. Indeed, Mar- 
tinsson & Rokhlin (2007) use a strongly related matrix 
compression technique to implement the FMM itself. 

Interpolation also causes the precomputed data to be- 
come independent of the chosen grid and resolution. 
However, we found the constant prefactor in the FMM 
to be quite high, and including it only as a matter of con- 
venience appears to be out of the question for our target 
resolutions. Since the FMM has a linear computational 
scaling, the question should be revisited for higher reso- 
lutions. 

3.4. CPU and memory trade-offs 

So far we have focused on reducing the number of 
floating point operations (FLOPs). However, during the 



past decade the speed of the CPU has increased much 
more rapidly than the system memory bandwidth, so 
that in current multi-core computers it is easy to get in 
a situation where the CPUs are starved for data to pro- 
cess. When processing only one or a few transforms con- 
currently, the volume of the precomputed data is much 
larger than the volume of the maps being transformed, 
so that the limitation is moving the precomputed data 
over the memory bus, not processing power. Note that 
in the case of very many simultaneous transforms the 
problem is alleviated since the movement of precomputed 
data is amortized. Following in the footsteps of libpsht, 
and our own requirements in CMB analysis, we have re- 
stricted our attention to between one and ten concurrent 
transforms. While the butterfly algorithm probably per- 
forms well in the face of many concurrent transforms, it 
would require additional blocking and optimization be- 
yond what we have implemented, so that movement of 
the working set in memory is properly amortized. Note 
that as each m is processed independently, the working 
set is only about 1/L of the total input. 

The considerations above motivates stopping compres- 
sion early, after a significant reduction in the floating- 
point operation count has been achieved, but before the 
size of the precomputed data becomes too large (see Fig- 
ure 3). Butterfly compression has the convenient feature 
that the blocks in the residual matrix R consists of con- 
tiguous slices from columns of A. By orienting A so that 
rows are indexed by I and columns by z, the elements 
of the residual blocks can be computed on the fly from 
three-term recurrence formulas for the associated Legen- 
dre functions. We return to this topic in Section A. 2. 

As an example, consider N s ^ e = 2048 and m = 400. 
The uncompressed matrix A takes 64 MB in double pre- 
cision. This can be compressed to 20% of the original 
size by using five levels of compression, with the uncom- 
pressed residual R accounting for about 13% of the com- 
pressed data. If one instead stops after three levels of 
compression, then although the size of the compressed 
data has now grown to 24% of the original, 57% of this 
is made out of elements in R. Since one only needs to 
store two elements for every column of 512 elements in R 
and can generate the rest on the fly, stopping comprcs- 



Wavemoth: Fast spherical harmonic transforms 



7 




1 2 3 4 5 6 

Number of compression iterations 



Figure 3. Effect of each level of butterfly compression. The size 
of the compressed data (solid) is the sum of elements in residual 
uncompressed blocks in R (dashed) and the interpolation matrices 
Si (dotted). While R can be generated on the fly during trans- 
forms, the S; needs to be stored as precomputed data, so that the 
choice of compression level is a trade-off between CPU use and 
the size of the precomputed data. Parameters for this figure are 
Wsidc = 2048, L = SJVside, m = 0, and the initial chunk size 32 
columns. 

sion after three levels reduces the memory bus traffic and 
size of precomputed data by about 40%, at the cost of 
some extra CPU instructions. Note that the brute-force 
codes may simply be seen as the limit of zero levels of 
compression. 

4. IMPLEMENTATION & RESULTS 
4.1. Technology 

The Wavemoth library is organized in a core part and 
an auxiliary part. The core is primarily written in C and 
contains the routines for performing spherical harmonic 
transforms. The auxiliary shell around the core is writ- 
ten as a Python package, and is responsible generating 
the precomputed data using the butterfly compression 
algorithm, as well as the regression and unit tests. 

By writing the core in pure C we remain close to the 
hardware, and make sure the library can be used with- 
out Python. C remains the easiest language to call from 
other languages such as Fortran, CH — h, Java, Python, 
MATLAB, and so on. By using Python in the auxiliary 
support code we accelerate development of the parts that 
are not performance critical, and make writing tests a 
pleasant experience. Being able to quickly write up unit 
tests is an indispensable tool, as it allows optimizing the 
C code iteratively without introducing bugs. Since indi- 
vidual pieces of the C core is tested, there is both a public 
API for end-users and a private API that is used from 
Python to test individual C routines in isolation. Much 
of the support code is implemented in Cython (Behncl 
et al. 2011), which bridges the worlds of Python and C. 

The C core depends on files containing precomputed 
data, a Fourier transform library and a BLAS library. 
For the latter two we use FFTW3 (Frigo & Johnson 2005) 
and ATLAS (Whaley et al. 2001), respectively. Parts of 
the Wavemoth core is written using templates in order 
to generate many slight variations of the same C rou- 
tine. We use Tempita 3 , a purely text-oriented templating 

3 http://pythonpaste.org/tempita/ 



language, and find this to be much more convenient for 
optimizing a computational core than the type-oriented 
templates of C++. During the precomputations, we use 
the open source Fortran 77 library ID 4 to compute the 
Interpolative Decomposition, and libpsht to generate the 
associated Legendre functions. 

Unlike libpsht, we have not focused on portability, and 
Wavemoth is only tested on 64-bit Linux with the GCC 
compiler on Intel-platform CPUs. Computational cores 
are written using SSE intrinsics and 128-bit registers. 
More work is needed for optimal performance on the lat- 
est Intel micro-architecture, which support 256-bit regis- 
ters, or on non-Intel platforms. Beyond that, we expect 
no hurdles in improving portability. 

4.2. Benchmarks 

We include benchmarks for two different systems 
with different memory bandwidth, as Wavemoth's per- 
formance is deeply influenced by this aspect of the 
hardware. Figure 4 presents benchmarks taken on a 
64-core 2.27 GHz Intel Xcon X7560 (Nehalem micro- 
architecture), which has a compute-to-bandwidth ratio 
of about 45:1. Figure 5 presents benchmarks taken on 
a 48-core 2.2 GHz AMD Opteron 6174. The compute- 
to-bandwidth ratio is in this case about 64:1, signifi- 
cantly worse than the Intel system 5 . The consequence 
is that butterfly compression gives less of an advan- 
tage, with only about four times speedup over libpsht 
at L = 4096, compared to the corresponding six times 
speedup achieved on the Intel system. In the case of ten 
simultaneous transforms, libpsht achieves a very consis- 
tent 2x speedup which Wavemoth is not able to fully 
match, as most of our tuning effort has been on the sin- 
gle transform path. 

The highest tested accuracy of e = 10~ 13 for the Leg- 
endre transforms was chosen because current codes using 
the HEALPix grid only agree to this accuracy on high 
resolutions (Reinecke 2011). 

An important aspect of the systems for our purposes is 
the non- uniform memory access (NUMA). On each sys- 
tem, the CPU cores are grouped into eight nodes, and 
the RAM chips evenly divided between the nodes. Each 
CPU only have direct access to RAM chips on the lo- 
cal node, and must go through a CPU interconnect bus 
to access other RAM chips. For consistent performance 
we need to ensure that Wavemoth distributes the pre- 
computed data in such a way that each CPU finds the 
data it needs in its local RAM chips. In the benchmarks 
we always use a whole number of nodes, so that com- 
putation power and memory bandwidth scale together. 
The exception is benchmarks using a single core, but in 
those cases, Wavemoth's precomputed data fits in cache 
anyway. 

Table 1 list the sizes of the precomputed data. To 
balance bandwidth and CPU requirements as described 
in Section 3.4, the precomputation code takes a param- 
eter p, specifying the cost of floating point operations 

4 http://cims.nyu.edu/~tygert/software.html 

5 The Intel system supports transfer of 13 billion numbers per 
second and has theoretical peak compute power 580 GFLOPS, us- 
ing all 64 cores. The AMD system supports transfer of 6.5 billion 
numbers per second and has theoretical peak compute power of 422 
GFLOPS, using all 48 cores. All numbers refer to double precision 
floating point. 




Figure 4. Benchmarks for full SHTs performed on the Intel system. The left pane shows timings for a single transform, the right for 
ten simultaneous transforms. We scale up the number of CPU cores together with the resolution. Each pane is divided into four partially 
overlapping segments corresponding to 1, 8, 16, 32, and 64 CPU cores, respectively (indicated by white/gray backgrounds and changes 
in line colors). The libpsht code (red triangles) is compared to Wavemoth (blue/black) with no compression (solid, circles), compression 
with precision 10 -13 (dashed, diamonds) and compression with precision 10 -8 (dotted, crosses). In each case we use a HEALPix grid 
with resolution iV s i<i e = L/2. Note for instance how both codes suffer from parallclization overhead at the transition from one to eight 
cores, but that libpsht suffers less and catches up with Wavemoth. For a single transform at high resolutions, the situation is the contrary, 
with Wavemoth parallelizing better at the jump from 16 to 32 cores and from 32 to 64 cores. We repeated each benchmark multiple times 
both with and without HyperThreading, and report the fastest wall clock time achieved multiplied with the number of CPU cores used 
and divided by the number of simultaneous transforms. Some 32-core timings for ten simultaneous transforms at L = 8192 could not be 
obtained due to memory limitations. The load time of the precomputed data from the hard drive is not included. 



Wavemoth: Fast spherical harmonic transforms 



9 




4096 



Figure 5. Benchmarks for full SHTs performed on the AMD system, using all 48 CPU cores. The libpsht code (red triangles) is compared 
to Wavemoth (blue) with no compression (solid, circles), compression with precision 10 — 13 (dashed, diamonds), compression with precision 
10 — 8 (dotted, crosses). Left pane shows a single transform and the right pane ten simultaneous transforms. In each case we use a HEALPix 
grid with resolution -/V sidc = L/2. The large speedup in the range L = 256. .1024 is in part due to Wavemoth scaling better to all 48 cores, 
and is closer to a 2x speedup when using fewer cores. We repeated each benchmark multiple times, and report the fastest wall clock time 
achieved multiplied with the number of CPU cores used and divided by the number of simultaneous transforms. 



Table 1 

Size of precomputed data 



L 


No comp. 


Intel (p 


= 7.5) 


AMD (p 


= 18) 


Precomputation time 






Tol. 10" 13 


Tol. 10~ 8 


Tol. 10" 13 


Tol. 10~ 8 


(CPU minutes) 


32 


130 KiB 










0.02 


64 


496 KiB 










0.03 


128 


2.0 MiB 


2.0 MiB 


2.0 MiB 


1.9 MiB 


1.9 MiB 


0.22 


256 


8.0 MiB 


8.0 MiB 


8.0 MiB 


7.1 MiB 


7.1 MiB 


2.6 


512 


27 MiB 


174 MiB 


187 MiB 


27 MiB 


27 MiB 


7.4 


1024 


102 MiB 


937 MiB 


988 MiB 


102 MiB 


170 MiB 


12 


2048 


389 MiB 


6.0 GiB 


5.8 GiB 


4.4 GiB 


4.3 GiB 


90 


4096 


1.5 GiB 


38 GiB 


35 GiB 


28 GiB 


27 GiB 


536 


8192 


5.8 GiB 


212 GiB 


208 GiB 






4380 



Note. — The precomputation time quoted is the wall time taken to compute at 
tolerance 10 -13 on the Intel system, multiplied by the number of CPU cores used. We 
use 1 core for L — 32, and then scale up gradually to 64 cores at L = 8192. The 
precomputed data is saved to a network file system. 



Table 2 

Samples of numerical accuracy 



L No 


compression 


Tolerance 10~ 13 Tolerance 10" 8 


8 


8.6e-16 




16 


1.4e-15 




32 


2.7c-15 




64 


5.8e-15 




128 


1.2e-14 




512 


5.1c-14 


4.7e-14 1.9e-09 


1024 


1.3c-13 


8.9e-14 2.4e-09 


2048 


2.7c-13 


1.7e-13 3.1e-09 


4096 


6.4e-13 


3.3e-13 3.7e-09 


8192 


2.2e-12 


6.6e-13 4.2e-09 


Note. — In each case 
sian sample is compared 


, the transform of a single Gaus- 
against libpsht double precision 



results. 
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in the bandwidth-intensive butterfly matrix application 
stage relative to the cost of floating point operations in 
the CPU-intensive brute-force Legendre transform stage. 
The parameter was then tuned for the single-transform 
case for L — 4096, resulting in optimal choices of p = 7.5 
on the Intel system and p = 18 on the AMD system. Per- 
forming the precomputations scales as 0(L 3 ). In the case 
of no compression, we still store the precomputed quan- 
tities necessary for the Legendre recurrence relations in 
memory, as described in the appendix. Loading this data 
from memory is not necessarily faster than computing it 
on the fly, but doing so saved some development time. 

All methods involved are numerically stable and well 
understood, so we do not include a rigorous analysis of 
numerical accuracy. Table 2 lists the relative error from 
transforming a single set of standard Gaussian coeffi- 
cients per configuration. We use the relative error 



\ 



JV pix 

E 

i=l 



(xi - yi) 2 / 



(24) 



where Xi denote the result of libpsht and yi the result of 
our code. The discrepancies in the no-compression, high- 
L cases are due to using a different recurrence for the 
associated Legendre functions, as described in Appendix 
A. 2. As we did not compare with higher precision results, 
it is not clear whether it is our code, libpsht, or both, 
that loose precision with higher resolution. Note that 
the input data to the butterfly compression is generated 
using libpsht. 

4.3. Higher resolutions 

Due to memory constraints we have not gone to higher 
resolutions than L ~ 8000. Instead, we provide esti- 
mates for the number of required floating point opera- 
tions. Tygert (2010) provide similar estimates, but focus 
on the behavior for the Legendre transform for single to 
rather than the full SHT. 

At each resolution, we compress A° dd for 20 different 
to, and fit the cost estimate 



p=0 



/3 p m log(l + to) 



(25) 



by least squares minimization in the parameters a, /?o, Pi 
and /?2- The final cost is then estimated by 



Ctotal 



2£< 

m=0 



(26) 



since A^ en and A^ d has almost identical behavior. The 
results can be seen in Figure 2. For L ~ 130000, the 
butterfly algorithm requires only 1% of the arithmetic 
operations of a brute force transform. The size of the 
precomputed data at this resolution is around 45 TiB in 
double precision, although this can be reduced by using 
the hybrid approach of Section 3.4. 

At low resolutions, the algorithm is bound by the 
0(L 3 ) operations of the brute-force Legendre transform. 
At high resolutions, the 0(L 2 log L) trajectory is clearly 
a better fit than the 0(L 2 log L) scaling conjectured by 
Tygert (2010). Note that the numerical evidence pre- 
sented in Tygert (2010) show that the average k increases 




256 



512 



1024 



2048 



Figure 6. Comparison of Wavemoth (black circles) with the algo- 
rithm of Mohlenkamp (1999) as implemented in libftsh (red trian- 
gles), with accuracy 10 -13 (solid) and 10 — 8 (dotted). Both codes 
are run on a single core. Only the Legendre transform part is 
benchmarked, as libftsh does not implement the full SHT. Wave- 
moth uses a HEALPix grid with 2L — 1 rings, while libftsh uses a 
Gaussian grid with 2L rings. The placement of the rings should 
make little difference to the performance of either code. 

monotonically with to, so it may indeed be the case that 
the rank property is not fully satisfied, or only satis- 
fied conditional on to. The benchmark results of Tygert 
(2010) seem to be in agreement with the 0(L 2 log 2 L) 
hypothesis as well. 

4.4. Comparison with other fast SHT algorithms 
A widely known scheme for fast SHTs is the 
0(L 2 log 2 L) transform of Healy et al. (2003), imple- 
mented in SpharmonicKit. It algebraically expresses a 
Legendre transform of degree L as a function of two Leg- 
endre transforms of degree L/2, resulting in a divide- 
and-conquer scheme similar to the FFT algorithms. Un- 
fortunately, the scheme is inherently numerically unsta- 
ble, and special stabilization steps must be incorporated. 
Also, it is restricted to equiangular grids, so that it can 
not be used directly with the HEALPix or GLESP grids. 
Wiaux et al. (2006) benchmarks SpharmonicKit against 
the original HEALPix implementation (pre 2.20) and 
find that it is almost three times slower at L = 1024. 
Keep in mind that libpsht, used in present releases of 
HEALPix, is about twice as fast as the original HEALPix 
implementation. Considering the above, we stop short of 
a direct comparison between Wavemoth and Spharmon- 
icKit. Note that while SpharmonicKit achieves much 
higher accuracy of an SHT round-trip than HEALPix 
does, this is an effect of the different sampling grids be- 
ing used, not of the computational method, and it is 
straightforward to extend the Wavemoth code to use the 
same grid as SpharmonicKit. 

Mohlenkamp (1999) uses a matrix compression tech- 
nique similar to the one employed in this paper, which is 
independent of the pixel grid chosen. A matrix related 
to the A of the present paper is locally approximated by 
truncated trigonometric series. The resulting SHT algo- 
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rithm scales as 0(L 5 / 2 logL). As shown in Figure 6, the 
code behaves very similarly to our code at medium reso- 
lution, as long as one do not require too much numerical 
accuracy. The size of the precomputed data is also of the 
same order, sometimes half and sometimes double that 
of Wavemoth's data. 

Note that libftsh appears to have potential for opti- 
mization for modern platforms, and this should be taken 
into account when comparing the algorithms. Due to its 
age, libftsh makes assumptions about 32-bit array sizes 
which prevents comparison at higher resolutions without 
porting libftsh to 64- bit. The libftsh code contains an im- 
plementation of the Legcndre transforms only, and not 
of the full spherical harmonic transforms. It should be 
straightforward to modify Wavemoth to use libftsh for 
its Legendre transforms in order to perform full SHTs 
using this algorithm. 

The compression scheme of Mohlcnkamp (1999) ap- 
pears to be very competitive for low accuracy trans- 
forms, but less so if higher precision is needed. It may 
be fruitful to hybridize the algorithms of Tygert (2010) 
and Mohlenkamp (1999) and use both together to com- 
press a single matrix. Even if that does not work, one 
can simply use whichever performs best for a given m. 

5. DISCUSSION 

There is significant potential in speeding up spherical 
harmonic transforms beyond the codes in popular use to- 
day. We achieved a 2x speedup at low and medium res- 
olutions simply due to restructuring how the brute-force 
computations are done, and believe there is potential for 
even more speedup if time is spent on profiling and micro- 
optimization. In particular, our code is under-optimized 
for multiple simultaneous transforms. 

At the highest resolutions in practical use in cosmol- 
ogy today, L ~ 4000, use of the butterfly compression 
is borderline. One the one hand, it does yield an addi- 
tional 2x speedup; potentially much more if one needs 
less accuracy. On the other hand, it requires between 
30 and 40 GiB of precomputed data in memory, and 
the transportation of that data over the memory bus 
for every set of transforms. The result is a delicate bal- 
ance between bandwidth and achieved speedup; for every 
number stored in the precomputed data, one might save 
40 arithmetic operations, but then again computation is 



much cheaper than accessing system memory on present- 
day computer architectures. 

In Section 3.3, we note the existence of interpolation 
schemes that cut the necessary sample points for brute- 
force codes by two thirds in the case of the HEALPix 
grid; although performing the interpolation step does 
not come for free. It seems that the speedup from such 
interpolation alone could on the same order as what 
the butterfly algorithm achieves for the current needs 
of CMB research. The advantage is that it does not re- 
quire nearly as much precomputed data, and is so much 
less architecture-dependent and easier to micro-optimize. 
In going forward we therefore anticipate spending more 
effort on direct interpolation schemes and less effort on 
matrix compression. For resolutions higher than those 
needed in CMB analysis, matrix compression schemes 
seem like the most mature option at the moment. 

We have not discussed spin-weighted spherical har- 
monic transforms, which are crucial to analyzing the po- 
larization properties of the CMB. However, Kostelec et 
al. (2000) and Wiaux ct al. (2007) describe how the trans- 
form of a polarized CMB map can be reduced to three 
scalar transforms. This would additionally help amor- 
tize the memory bus transfer of the precomputed data. 
Alternatively, it may be possible to compress the spin- 
weighted spherical harmonic operators. 

We consider Wavemoth an experimental code for the 
time being, and spherical harmonic analysis has been left 
out. This was done purely to save implementation time, 
and we know of no obstacles to implementing this using 
the same methods. The code also lacks support for MPI 
parallelization, although we expect adding such support 
to be straightforward. The only inter-node communica- 
tion requirement is a global transpose of q m {zk) between 
the Legendre transforms and the Fourier transforms. 

The author thanks S. K. Naess, H. K. Eriksen, M. 
Tygert, M. Reinecke and M. Mohlcnkamp for useful dis- 
cussions, and M. Omang and F. Hansen for lending the 
benchmark hardware. The author is funded by European 
Research Council grant StG2010-257080. The bench- 
mark hardware is funded by the Norwegian Defence Es- 
tates Agency and the Research Council of Norway. 



APPENDIX 
IMPLEMENTATION DETAILS 
Applying the compressed matrix representation to a vector 

On modern computers, the primary bottleneck is often to move data around. Fundamental design decisions were 
made with this in mind. Looking at the compressed representations of A in Section 3.2, the immediate algorithm 
that comes to mind for computing Ax or A T x is the breadth-first approach: First compute Six, then permute the 
result, then compute S 2 (PiSix), and so on. However, this leads to storing several temporary results for longer than 
they need to, since the rightmost permutations are very local permutations, and only the leftmost permutation is 
fully global. Therefore we traverse the data dependency tree set up by the permutations in a depth-first manner. 
The advantage of this approach is that it is cache oblivious when transforming a few vectors at the time. That is, it 
automatically minimizes data movement for any cache hierarchy, whereas breadth-first traversal will always drop to 
the memory layer that is big enough to hold the entire set of input vectors. Note that for transforming many maps 
at the same time, cache-size dependent blocking should be implemented in addition, but we have stopped short of 
this. Like Tygert (2010), we also do the compression during precomputation depth- first, which ensures that, per m, 
memory requirements go as O(LlogL) even though computation time go as 0(L 2 ). 

The core computation during tree traversal is to apply the interpolative matrices, e.g., Tx or T T x. Keep in mind 
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that the fc-by-n matrix T contains the fc-by-fc identity matrix in a subset of its columns; making use of this is important 
as it roughly halves the storage size and FLOP count. Given an ID T = tC'T, we can freely permute the rows of T, 
simply by permuting the columns of T^ fc ^ correspondingly We do this during precomputation to avoid the unordered 
memory usage pattern of arbitrary permutations. Instead, we can simply filter the input or output vectors into the 
part that hits the identity sub-matrix and the part that hits the dense sub-matrix. 

Efficient code for Legendre transforms 

As mentioned in Section 3.4, it is necessary to balance the amount of precomputed data to the memory bandwidth, 
so code is required to apply the residual blocks in R to vectors without actually storing R in memory. This means 
computing a cropped version of the Legendre transform, 

?'(^) = E Prn + 2k + t(Zj)a em , (Al) 

k— fc B tart 

where t = for the even transforms and t = 1 for the odd transforms. To compute P™ we use a relation that jumps 
two steps in I for each iteration (Tygert 2010): 

„ 7 2 _ jm _ c m ~ 

= (z 2 + a?){3TPr{z) + lTPr-2(z), 

with 

m= l (£-m+ !)(£ - m + 2)(l + m + l)(l + m + T) 
Cl V (2^+l)(2^ + 3) 2 (2£+5) 

and 

21(1 +1)- 2m 2 -1 
1 (2£-l)(2£ + 3) ' 

This recurrence relation requires five arithmetic operations per iteration, as opposed to a more widely used relation 
which takes one step in I and only needs four arithmetic operations per step (see, e.g., Press et al. 2007). However, 
since A ovcn and A odd may have different columns in the residual blocks of their compressed representations, relation 
(A2) is a better choice in our case. 

For each block in R we precompute aj 1 , [3™ and 7™, as well PJ™ tait {z) and ^™ tart +i( z ) f° r eacn z f° r initial conditions. 
Note that Pp(z) in parts of its domain take values so close to zero that they can not be represented in IEEE double 
precision. However, in these cases P™(z) is always increasing in the direction of increasing £, so we can simply increase 
fcstart correspondingly. In fact, we follow libpsht and assume that the dynamic range of the input data is small enough, 
within each m, that values of P/""(z) smaller than 10~ 30 in magnitude can safely be neglected. As far as possible we 
group together six and six columns with the same fc s t a rt and fc stop , for reasons that will soon become clear. 

For an efficient implementation, the first important point is to make sure the number of loads from cache into 
CPU registers is balanced with the number of floating-point operations. The second is to make sure there are enough 
independent floating-point operations in flight simultaneously, so that operations can be pipelined. Thus, 

• for performing a single transform with one real and one imaginary vector, the values of P™ should never need 
to leave the CPU registers. Rather, we fuse equation (Al) and equation (A2) in the core loop. For multiple 

simultaneous transforms we save P™ to cache, but make sure to process in small batches that easily fit in LI 
cache. 

• we process for several Zj simultaneously. This amortizes the register loads of a™, f3™ and 7™. It also ensures 
that there are multiple independent chains of computation going on so that pipelining works well. 

In the single transform case with one real and one imaginary vector, we do the full summation for six zj at the 
time (when possible). The allocation of the 16 available 128-bit registers, each holding two double-precision numbers, 
then becomes three registers for P™, three for P£t 2 ' three for the auxiliary data a™, (3™ and 7™, six accumulation 
registers for q'(zj), and one work register. The z"j values are, perhaps counter-intuitively, read again from cache in 
each iteration, which conserves three registers and thus enables processing six Zj in each chunk instead of only four 

without register spills. Finally, when the time comes for multiplying P" 1 with ai m , the auxiliary data is no longer 
needed, leaving room for loading ae m . 

On the Intel Xeon system, the routine performs at 6.46 GFLOP/s per core (71% of the theoretical maximum) 
when benchmarked on all the Legendre transforms necessary for a full SHT across 32 cores. The effect of instruction 



(A2) 
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pipelining is evident; reducing the number of columns processed in each iteration from six to four reduces performance 
to 5.69 GFLOP/s (63%), and when only processing two columns at the time, performance is only 4.28 GFLOP/s 
(47%). 

We skip the details for the multiple transform case, but in short, in involves the same sort of blocking performed for 
matrix multiplication, including repacking the input data in blocks. Goto & van de Geijn (2008) provide an excellent 
introduction to blocking techniques. In this case the performance is 5.60 GFLOP/s (62%) per core when performing 
the Legcndrc transforms necessary for ten simultaneous SHTs. 

The considerations above guided the choice of loop structure, which was then implemented in pure C using SSE 
intrinsics. We did not spend much time on optimization, so there should be room for further improvements, in 
particular for the multiple-transform path. 

Data layout 

The butterfly compression algorithm naturally leads to the following code organization for spherical harmonic syn- 
thesis: 

1. Since each m is processed independently, we request input in m-major ordering. Also, for multiple simultaneous 
transforms, the coefficients of each map are interleaved, which is optimal both for the butterfly algorithm and 
the brute-force cropped Legendre transforms. In most places, the real and complex parts of the input can be 
treated as two independent vectors, since A is a real matrix. 

2. Compute all q m (zj) into a 2D array. Since each m is processed independently, this ends up in m-major ordering, 
like the input. 

3. While transposing the q m {zj) array into ring- major ordering, phase-shift and wrap around the coefficients, and 
perform FFTs on each ring. Rings must be processed in small batches in order to avoid loading cache lines 
multiple times. 

A temporary work buffer with size of the same order as the input and output is used for q m {zj). An in-place code 
should be feasible with the use of an in-place transpose. 

A drawback compared to brute- force codes is that q m (zj) needs to first be written to and then read from main 
memory. Here, libpsht is instead able to employ blocking, so that a few rings at the time are completely processed 
before moving on. Our benchmarks do however indicate that this is not a big problem in practice. Also, for cluster 
parallelization using MPI, it would be natural to follow S 2 HAT (Hupca et al. 2010; Szydlarski et al. 2011) in distributing 
the input data by m and the output data by rings, which also leads to a global transpose operation. 

Wavemoth stores the output maps in interleaved order, since FFTW3 is able to deal well with such transforms. The 
libpsht code is able to support any output ordering, although stacked, non-interleaved maps are slightly faster, so that 
is the ordering we use for libpsht in the benchmarks. 
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